Introduction to Bayesian Models

Steve Elston

01/13/2021

Review

The general process of hypothesis testing can be described as:

  1. Define a sampling plan and collect data following the plan
  2. Select the test statistic to be used
  3. Determine the null distribution for the null hypothesis H0
  4. Define a cutoff value, alpha, which determines the acceptable the Type. I Error rate
  5. Compute the statistic for the data (observations)
  6. Given the value of the statistic compute the p-value. The p-value is the area (integral) under the null distribution from the statistic to the limit of the distribution (+/- infinity). The p-value represents the probability of the statistic being as extreme or more extreme than the one observed
  7. Compare the p-value to the cutoff selected

Review

Compare the p-value to the cutoff selected

There are only two possible conclusions

  1. We reject the null hypothesis with confidence alpha (on the p-value)

  2. We cannot reject the null hypothesis since either, i) the null hypothesis is true and there is no difference between the data distribution of the null distribution, or ii) there is insufficient evidence given the effect size to reject the null hypothesis (insufficient power)

Beware of multiple hypothesis tests:

Review

For one-tailed test of hypothesis

bg right w:500

Review

There are a great many tests to choose from

Data Hypothesis Distribution Test
Two samples of continuous variable Difference of means t t-test
Counts for different categories Counts are different \(\chi^2\) Pearson’s \(\chi^2\)
Multiple (groups) of continuous variables Distribution of groups are different \(F = \frac{between\ group\ variance}{within\ group\ variance}\) ANOVA
Sample from a distribution Does the variable have a distribution Kolmogorov-Smirnov
Kolmogorov-Smirnov

Introduction

Despite the long history, Bayesian models have not been used extensively until recently

Introduction

Bayesian analysis is a contrast to frequentist methods

Bayesian Model Use Case

Bayesian methods made global headlines with the successful location of the missing Air France Flight 447

Posterior distribution of locations of Air France 447

Posterior distribution of locations of Air France 447

Bayesian vs. Frequentist Views

With greater computational power and general acceptance, Bayes methods are now widely used

Bayesian vs. Frequentist Views

Can compare the contrasting frequentist and Bayesian approaches

Comparison of frequentist and Bayes methods

Comparison of frequentist and Bayes methods

Review of Bayes Theorem

Bayes’ Theorem is fundamental to Bayesian data analysis.

\[P(A \cap B) = P(A|B) P(B) \]

We can also write:

\[P(A \cap B) = P(B|A) P(A) \]

Eliminating \(P(A \cap B):\)

\[ P(B)P(A|B) = P(A)P(B|A)\]

Or, Bayes theorem!

\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]

Marginal Distributions

In many cases we are interested in the marginal distribution

\[p(\theta_1) = \int_{\theta_2, \ldots, \theta_n} p(\theta_1, \theta_2, \ldots, \theta_n)\ d\theta2, \ldots, d \theta_n\] - But computing this integral is not easy!

Marginal Distributions

\[ p(\theta) = \sum_{x \in \mathbf{X}} p(\theta |\mathbf{X})\ p(\mathbf{X}) \]

\[ p(\mathbf{X}) = \sum_{\theta \in \Theta} p(\mathbf{X} |\theta) p(\theta) \]

Interpreting Bayes Theorem

How can you interpret Bayes’ Theorem?

\[Posterior\ Distribution = \frac{Likelihood \bullet Prior\ Distribution}{Evidence} \]

\[ posterior\ distribution(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠│𝑑𝑎𝑡𝑎) = \\ \frac{Likelihood(𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)\ 𝑃rior(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)}{P(data)} \]

\[ 𝑃(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠│𝑑𝑎𝑡𝑎) = \frac{P(𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)\ 𝑃(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)}{P(data)} \]

Interpreting Bayes Theorem

What do these terms actually mean?

  1. Posterior distribution of the parameters given the evidence or data, the goal of Bayesian analysis

  2. Prior distribution is chosen to express information available about the model parameters apriori

  3. Likelihood is the conditional distribution of the data given the model parameters

  4. Data or evidence is the distribution of the data and normalizes the posterior

Relationships can apply to the parameters in a model; partial slopes, intercept, error distributions, lasso constants, etc

Applying Bayes Theorem

We need a tractable formulation of Bayes Theorem for computational problems

\[ 𝑃(𝐵 \cap A) = 𝑃(𝐵|𝐴)𝑃(𝐴) \\ And \\ 𝑃(𝐵)=𝑃(𝐵 \cap 𝐴)+𝑃(𝐵 \cap \bar{𝐴}) \]

Where, \(\bar{A} = not\ A\), and the marginal distribution, \(P(B)\), can be written:

\[ 𝑃(𝐵)=𝑃(𝐵|𝐴)𝑃(𝐴)+𝑃(𝐵│ \bar{𝐴})𝑃(\bar{𝐴}) \]

Applying Bayes Theorem

Using the foregoing relations we can rewrite Bayes Theorem as:

\[ P(A|B) = \frac{P(A)P(B|A)}{𝑃(𝐵│𝐴)𝑃(𝐴)+𝑃(𝐵│ \bar{𝐴})𝑃(\bar{𝐴})} \\ \]

Rewrite Bayes Theorem as:

\[𝑃(𝐴│𝐵)=𝑘∙𝑃(𝐵|𝐴)𝑃(𝐴)\]

Ignoring the normalization constant \(k\):

\[𝑃(𝐴│𝐵) \propto 𝑃(𝐵|𝐴)𝑃(𝐴)\]

Interpreting Bayes Theorem

Denominator must account for all possible outcomes, or alternative hypotheses, \(h'\):

\[Posterior(hypothesis\ |\ evidence) =\\ \frac{Likelihood(evidence\ |\ hypothesis)\ prior(hypothesis)}{\sum_{ h' \in\ All\ possible\ hypotheses}Likelihood(evidence\ |\ h')\ prior(h')}\]

This is a formidable problem!

Bayes Theorem Example

Hemophilia is a serious genetic condition expressed on any X chromosome

Bayes Theorem Example

As evidence the woman has two sons (not identical twins) with no expression of hemophilia

\[ p(x_1=0, x_2=0 | \theta = 1) = 0.5 * 0.5 = 0.25 \\ p(x_1=0, x_2=0 | \theta = 0) = 1.0 * 1.0 = 1.0 \]

Note: we are neglecting the possibility of a mutations in one of the sons

Bayes Theorem Example

Use Bayes theorem to compute probability woman carries an X chromosome with hemophilia expression, \(\theta = 1\)

\[ p(\theta=1 | X) = \frac{p(X|\theta=1) p(\theta=1)}{p(X|\theta=1) p(\theta=1) + p(X|\theta=0) p(\theta=0)} \\ = \frac{0.25 * 0.5}{0.25 * 0.5 + 1.0 * 0.5} = 0.20 \]

The evidence of two sons without hemophilia causes us to update our belief that the probability of the woman carrying the disease

Note: The denominator is the sum over all possible hypthises, the marginal distribution of the observations \(\mathbf{X}\)

Simplified Relationship for Bayes Theorem

How to we interpret the foregoing relationship?

Rewrite Bayes Theorem as:

\[𝑃(𝐴│𝐵)=𝑘∙𝑃(𝐵|𝐴)𝑃(𝐴)\]

Ignoring the normalization constant \(k\):

\[𝑃(𝐴│𝐵) \propto 𝑃(𝐵|𝐴)𝑃(𝐴)\]

\[Posterior\ Distribution \propto Likelihood \bullet Prior\ Distribution \\ Or\\ 𝑃( 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 │ 𝑑𝑎𝑡𝑎 ) \propto 𝑃( 𝑑𝑎𝑡𝑎 | 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 )𝑃( 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 ) \]

Creating Bayes models

The goal of a Bayesian analysis is computing and performing inference on the posterior distribution of the model parameters

The general steps are as follows:

  1. Identify data relevant to the research question

  2. Define a sampling plan for the data. Data need not be collected in a single batch

  3. Define the model and the likelihood function; e.g. regression model with Normal likelihood

  4. Specify a prior distribution of the model parameters

  5. Use the Bayesian inference formula to compute posterior distribution of the model parameters

  6. Update the posterior as data is observed

  7. Inference on the posterior can be performed; compute credible intervals

  8. Optionally, simulate data values from realizations of the posterior distribution. These values are predictions from the model.

Updating Bayesian Models

An advantage of Bayesain model is that it can be updated as new observations are made

How can you choose a prior?

The choice of the prior is a difficult, and potentially vexing, problem when performing Bayesian analysis

How can you choose a prior?

How to use prior empirical information to estimate the parameters of the prior distribution

Conjugate Prior Distributions

An analytically and computationally simple choice for a prior distribution family is a conjugate prior

Conjugate Prior Distributions

Most commonly used distributions have conjugates, with a few examples:

Likelihood Conjugate
Binomial Beta
Bernoulli Beta
Poisson Gamma
Categorical Dirichlet
Normal - mean Normal
Normal - variance Inverse Gamma
Normal - inverse variance, \(\tau\) Gamma

Example using Conjugate Distribution

We are interested in analyzing the incidence of distracted drivers

\[ P(k) = \binom{n}{k} \cdot \theta^k(1-\theta)^{n-k}\]

Our process is the:

  1. Use the conjugate prior, the Beta distribution with parameters \(\alpha\) and \(\beta\)
  2. Using the data sample, compute the likelihood
  3. Compute the posterior distribution of distracted driving
  4. Add more evidence (data) and update the posterior distribution.

Example using Conjugate Distribution

What are the properties of the Beta distribution?

Beta distribution for different parameter values

Beta distribution for different parameter values

Example using Conjugate Distribution

Consider the product of a Binomial likelihood and a Beta prior

\[\begin{align} posterior(\theta | z, n) &= \frac{likelihood(z,n | \theta)\ prior(\theta)}{data\ distribution (z,n)} \\ p(\theta | z, n) &= \frac{Binomial(z,n | \theta)\ Beta(\theta)}{p(z,n)} \\ &= Beta(z + a -1,\ n-z+b-1) \end{align}\]

Example using Conjugate Distribution

There are some useful insights you can gain from this relationship:

\[ posterior(\theta | z, n) = Beta(z + a -1,\ n-z+b-1) \]

-Evidence is also in the form (actual) counts of successes, \(z\) and failure, \(n-z\)
- The more evidence the greater the influence on the posterior distribution
- Large amount of evidence will overwhelm the prior
- With large amount of evidence, posterior converges to frequentist model

Example using Conjugate Distribution

Consider example with:
- Prior pseudo counts \([1,9]\), successes \(a = 1 + 1\) and failures, \(b = 9 + 1\)
- Evidence, successes \(= 2\) and failures, \(= 8\)
- Posterior is \(Beta(10 + 2 -1,\ 10 + 8 -1) = Beta(11,\ 17)\)

## 
## 
## Maximum of the prior density = 0.100
## Maximum likelihood 0.200
## MAP = 0.150

Example using Conjugate Distribution

Adding additional evidence, same prior:
- Prior pseudo counts \([1,9]\), successes \(a = 1 + 1\) and failures, \(b = 9 + 1\)
- Evidence, successes \(= 10\) and failures, \(= 30\)
- Posterior is \(Beta(10 + 2 -1,\ 40 - 10 + 10 -1) = Beta(11,\ 39)\)

## 
## 
## Maximum of the prior density = 0.100
## Maximum likelihood 0.340
## MAP = 0.280

Credible Intervals

How can we specify the uncertainty for a Bayesian parameter estimate?

Credible Intervals

Credible Intervals

What are the 95% credible intervals for \(Beta(11,\ 39)\)?

Probability of distract drivers for next 10 cars

Probability of distract drivers for next 10 cars

Simulating from the posterior distribution: predictions

What else can we do with a Bayesian posterior distribution beyond credible intervals?

Simulating from the posterior distribution: predictions

Example; What are the probabilities of distracted drivers for the next 10 cars with posterior, \(Beta(11,\ 39)\)?

Probability of distract drivers for next 10 cars

Probability of distract drivers for next 10 cars

Intervals based on 1,000 trials

Chain rule of probability

Use the chain rule of probability to factor a joint distribution into hierarchy

\[P(A,B) = P(A|B)P(B)\]

\[P(A_1, A_2, A_3, A_4 \ldots, A_n) = P(A_1 | A_2, A_3, A_4, \ldots, A_n)\ P(A_2, A_3, A_4 \ldots, A_n)\]

\[P(A_1, A_2, A_3, A_4 \ldots, A_n) =\\ P(A_1 | A_2, A_3, A_4, \ldots, A_n)\ P(A_2 | A_3, A_4 \ldots, A_n)\\ P(A_3| A_4 \ldots, A_n) \ldots P(A_n)\]

Chain rule of probability

The factorization is not unique.

\[P(A_1, A_2, A_3, A_4 \ldots, A_n) =\\ P(A_n | A_{n-1}, A_{n-2}, A_{n-3}, \ldots, A_1)\ P(A_{n-1}| A_{n-2}, A_{n-3}, \ldots, A_1)\\ P(A_{n-2}| A_{n-3}, \ldots, A_1) \ldots p(A_1)\]

the billiards game

Roll a ball on a billiard table and mark its position on the table length-wise (across only one dimension)

What is the probability of Bob winning the game?
- Ball needs to land on Bob’s side the next three rounds.

bg w:1000

the Frequentist approach

Let \(B\) be the event that Bob wins, \(D\) the data, or current score

the Bayesian approach

Summary

Bayesian analysis is a contrast to frequentist methods